網際網路半結構性資料擷取系統之設計與實作

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：99

、訪客IP：18.117.91.153

姓名

呂紹誠(Shao-Chen Lui ) 查詢紙本館藏

畢業系所

資訊工程研究所

論文名稱

網際網路半結構性資料擷取系統之設計與實作
(Design and Implementation of the wrapper generation system for Web-based Information Extraction)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 非簡單瀏覽路徑之探勘與應用	★ 遞增資料關聯式規則探勘之改進
★ 應用卡方獨立性檢定於關連式分類問題	★ 中文資料擷取系統之設計與研究
★ 非數值型資料視覺化與兼具主客觀的分群	★ 關聯性字組在文件摘要上的探討
★ 淨化網頁：網頁區塊化以及資料區域擷取	★ 問題答覆系統使用語句分類排序方式之設計與研究
★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘	★ 星狀座標之軸排列於群聚視覺化之應用
★ 由瀏覽歷程自動產生網頁抓取程式之研究	★ 動態網頁之樣版與資料分析研究
★ 同性質網頁資料整合之自動化研究	★ 時序性資料庫中未知週期之非同步週期性樣板的探勘

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網際網路的快速發展，已經改變了人類日常處理資訊的習慣，愈來愈多的資料是以HTML文件的格式呈現在WWW 上，如果可以將來自各個網站的資訊加以收集及分析，這些資訊便可以更有效地被利用，也就是所謂的『資訊整合』。而資訊整合系統，必須透過系統本體與資料源中間的包覆程式(Wrapper)來存取資料源。為解決各個資料源之間的差異，這些包覆程式經常是根據個別資料源的特性，以人工方式撰寫而成，然而網站的更新頻率快，人工撰寫的包覆程式必需花費大量人力及時間來維護及更新其包覆程式，因而許多的研究人員正積極發展各種可行的方法，來研發出可以自動建構出包覆程式的工具。
過去針對自動產生包覆程式的相關研究中，最主要是利用Wrapper Induction的方式來產生擷取規則，例如：WIEN、 STALKER、SoftMealy等等，效果非常出色，但美中不足的是，使用者必須先標示範例網頁上的資料，經過程式分析後才能得到擷取規則。而在本篇論文中，我們提供一個方法，可以自動化分析網頁的產生擷取規則。我們提出的系統IEPAD (Information Extraction based on Pattern Discovery)，便是利用自動化分析網頁的方法，使一個網站的包覆程式可以很簡單且快速地被建構。IEPAD包含三個部分，分別為：規則產生器、規則觀察工具以及擷取器，規則產生器運用了重覆的規則探勘及多重序列對齊等技巧，可以自動產生擷取每筆記錄的擷取規則，使用者可再利用規則觀察工具選取規則，透過多層式的分析，提供分析結果，讓使用者勾選所需要的屬性，進而產生擷取規則，最後搭配擷取器，便能擷取出每筆記錄範圍?的屬性資料。在實驗結果方面，針對14個著名的搜尋網站，IEPAD可以達到97%的高擷取率。

摘要(英)

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Since building Wrappers by hand is tedious and error-prone, the research in this field emphasizes the automatic generation of wrappers that can extract particular information from semi-structured Web documents. Previous work aims to learn extraction rules from users’ training example. They solve this problem by labeled training pages and grammar induction to automatically generate extraction rules. For example, WIEN, STALKER, and SoftMealy etc.. However, this approach still requires human intervention to provide training examples.
In this paper, we propose IEPAD, a system that automatically discover extraction rules from Web pages. IEPAD includes three components, an extraction rule generator which accepts an input Web page, a graphical user interface, called rule viewer, which shows record patterns discovered, and an extractor module which extracts desired information form similar Web pages according to the extraction rule chosen by the user. The system can automatically identify record boundary by pattern mining and multiple sequence alignment. Furthernore, attribute values can be extract by multi-level extraction. This new track to IE takes less human effort than other approach and involves no content-dependent heuristics. Experimental result shows that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines.

關鍵字(中)

★ 包覆程式
★ 半結構性資料
★ 網頁擷取
★ 資訊擷取
★ 資訊整合

關鍵字(英)

★ IEPAD
★ Information Extraction
★ Information Integration
★ Semistructured Data
★ wrapper

論文目次

目錄 I
圖形目錄 III
表格目錄 IV
圖表目錄 V
第1章緒論 1
1.1 研究背景 1
1.2 研究動機 4
1.3 研究目標 5
1.4 問題分析 6
1.5 論文架構 7
第2章相關研究與技術 8
2.1 相關研究 8
2.1.1 WIEN 8
2.1.2 STALKER 9
2.1.3 SoftMealy 11
2.1.4 Embley 12
2.1.5 IEPAD 13
2.2 相關技術 14
2.2.1 PAT Tree 14
2.2.2 多重序列對齊 16
2.3 本章總結 18
第3章系統架構 20
3.1 規則產生器 21
3.1.1 Token轉換及編碼 22
3.1.2 PAT Tree的建置 24
3.1.3 規則找尋 27
3.1.4 規則篩選 28
3.1.5 例外處理 29
3.1.6 規則合成 30
3.2 規則觀察工具 32
3.3 擷取器 35
3.3.1 規則比對 36
3.3.2 多層式之屬性擷取 37
第4章系統介紹與使用說明 39
4.1 規則產生器 39
4.2 規則觀察工具與擷取介面 43
第5章實驗結果 49
5.1 實驗測試資料說明 49
5.2 Token轉換方式及參數設定之影響53
5.3 利用擷取規則測試之結果 56
5.4 實驗結果討論 60
第6章結論 63
參考文獻 66

參考文獻

[1] Boris Chidlovskii, Jon Ragetli, and Maarten de Rijke. Automatic Wrapper Generation for Web Search Engines. In Proceedings of the 1st International Conference on Web-Age Information Management 2000 (WAIM-2000), pp. 399-410, LNCS Series, Shanghai, China, June 2000.
[2] Chia-Hui Chang and Chun-Nan Hsu. Automatic Extraction of Information Blocks Using PAT Trees. In Proceedings of 1999 National Computer Symposium (NCS-1999), Tamkang University, Tamsui, Taiwan, Dec 1999.
[3] Chia-Hui Chang, Shao-Chen Lui, and Yen-Chin Wu. Applying pattern mining to Web information extraction. In Proceedings of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2000), pp. 4-16, Hong Kong, Apr 2001.
[4] Chia-Hui Chang and Shao-Chen Lui. IEPAD: Information Extraction based on Pattern Discovery, In Proceedings of the 10th International Conference on World Wide Web (WWW10), pp. 595-609, Hong Kong, May 2001.
[5] Chun-Nan Hsu and Ming-Tzung Dung. Generating finite-state transducers for semi-structured data. Journal of Information Systems, Special Issue on Semi-structured Data, Volume 23, pp. 521-537, Aug 1998.
[6] Chun-Nan Hsu and Chien-Chi Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
[7] C.T. Kwok, and D.S. Weld, Planning to gather information. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pp. 32-39, AAAI Press, Menlo Park, California, 1996.
[8] D. Gusfield, Algorithms on strings, tree, and sequence, Cambridge. 1997.
[9] D. Smith, and M. Lopez, Information extraction for Semi-structured documents. In Proceedings of the Workshop on Management of Semi-Structured Data, Tucson, Arizona, 1997.
[10] D.W. Embley, Y.S. Jiang, and Y.K. Ng, Record-boundary discovery in Web documents. In Proceedings of 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD-99), pp. 467-478, Philadelphia, Pennsylvania, 1999.
[11] D.W. Embley, Y.S. Jiang, and Y.K. Ng, Recognizing Ontology-Applicable Multiple-Record Web Documents. Submitted.
[12] G. Gonnet, R. Baeza-Yates, and T. Snider, New Indices for Text: PAT Trees and PAT Arrays. In Bill Frakes, and B.Y. Ricardo, editors, Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs, Chapter 5 (pp. 66-82), NJ, USA, 1992.
[13] I. Muslea, S. Minton, and C. Knoblock, STALKER: learning extraction rules for semi-structured, Web-based information sources. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, AAAI Press, Menlo Park, California, 1998.
[14] I. Muslea, S. Minton, and C. Knoblock, A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Agents-99), pp. 190-197, Seattle, Washington, 1999.
[15] I. Muslea, Extraction patterns for information extraction tasks: a survey. In Proceedings of AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.
[16] J.R. Gruser, L. Raschid, M.E. Vidal, and L. Bright, Wrapper Generation for Web Accessible Data Sources. In Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems (CoopIS-98), pp. 14-23,1998.
[17] Jane Hsu and Wen-Tau Yih. Template-based information mining from html documents. In Proceedings of the 14th National Conference on Artificial Intelligence (AAAI-97), pp. 256-22, AAAI Press, Menlo Park, California, 1997.
[18] Jane Hsu, Wen-Tau Yih, Ching-Hung Leu, and Euna Jeong. Information Extraction from HTML Documents: An Approximate Tree Matching Approach. Submitted to AAAI-99, 1999.
[19] L.F. Chien, PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-97), pp.50-58, 1997.
[20] N. Ashish, and C.A. Knoblock, Semi-automatic Wrapper generation for internet information sources. In Proceedings of the International Conference on Cooperative Information Systems (CoopIS-97), pp.160-169, Charleston, South Carolina, 1997.
[21] N. Kushmerick, D. Weld, and R. Doorenbos, Wrapper Induction for information extraction. In Proceedings of the 15th International, Joint Conference on AI (IJCAI-97), pp. 729-737, 1997.
[22] N. Kushmerick, Wrapper Induction: Efficiency and expressiveness. Workshop on AI & Information Integration. In Proceedings Of AAAI-98 Workshop on Artificial Intelligence and Information Integration, pp. 15-68, AAAI Press, Menlo Park, California, 1998.
[23] R. Sedgewick, Algorithms in C, Addison Wesley, 1990.
[24] R.B. Doorenbos, O. Etzioni, and D.S. Weld, A scalable comparison- shopping agent for the world-wide web. In Proceedings of the 1st International Conference on Autonomous Agents (Agents-97), pp. 39-48, ACM Press, New York, NY, 1997.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2001-7-5

推文